Tagging a Norwegian Speech Corpus
نویسندگان
چکیده
This paper describes work on the grammatical tagging of a newly created Norwegian speech corpus: the first corpus of modern Norwegian speech. We use an iterative procedure to perform computer-aided manual tagging of a part of the corpus. This material is then used to train the final taggers, which are applied to the rest of the corpus. We experiment with taggers that are based on three different data-driven methods: memory-based learning, decision trees, and hidden Markov models, and find that the decision tree tagger performs best. We also test the effects of removing pauses and/or hesitations from the material before training and applying the taggers. We conclude that these attempts at cleaning up hurt the performance of the taggers, indicating that such material, rather than functioning as noise, actually contributes important information about the grammatical function of the words in their nearest context.
منابع مشابه
برچسبگذاری ادات سخن زبان فارسی با استفاده از مدل شبکۀ فازی
Part of speech tagging (POS tagging) is an ongoing research in natural language processing (NLP) applications. The process of classifying words into their parts of speech and labeling them accordingly is known as part-of-speech tagging, POS-tagging, or simply tagging. Parts of speech are also known as word classes or lexical categories. The purpose of POS tagging is determining the grammatical ...
متن کاملUniversal Dependencies for Norwegian
This article describes the conversion of the Norwegian Dependency Treebank to the Universal Dependencies scheme. This paper details the mapping of PoS tags, morphological features and dependency relations and provides a description of the structural changes made to NDT analyses in order to make it compliant with the UD guidelines. We further present PoS tagging and dependency parsing experiment...
متن کاملAn open source part-of-speech tagger for Norwegian: Building on existing language resources
This paper presents an open source part-of-speech tagger for the Norwegian language. It describes how an existing language processing library was used to build a new part-of-speech tagger for this language. This part-of-speech tagger has been built on already available resources, in particular a Norwegian dictionary and gold standard corpus, which were partly customized for the purposes of this...
متن کاملA Web-based Advanced and User Friendly System: The Oslo Corpus of Tagged Norwegian Texts
A general purpose text corpus meant for linguists and lexicographers needs to satify quality criteria at at least four different levels. The first two criteria are fairly well established; the corpus should have a wide variety of texts and be tagged according to a fine-grained system. The last two criteria are much less widely appreciated, unfortunately. One has to do with variety of search cri...
متن کاملRUNDKAST: an Annotated Norwegian Broadcast News Speech Corpus
This paper describes the Norwegian broadcast news speech corpus RUNDKAST. The corpus contains recordings of approximately 77 hours of broadcast news shows from the Norwegian broadcasting company NRK. The corpus covers both read and spontaneous speech as well as spontaneous dialogues and multipart discussions, including frequent occurrences of non-speech material (e.g. music, jingles). The recor...
متن کامل